Boosting¶

Try again. Fail again. Fail better. Samuel Beckett.

  1. I Contain Multitudes
    1.1 Decision Trees
    1.2 A Song of Bias and Variance
    1.3 Bagging vs Boosting
  2. The Strength of Weak Learners
    2.1 A short trip to Classtown
    2.2 The weakest of learners
  3. The AdaBoost Algorithm:
    3.1 Decision Stumps and Decision Boundaries
    3.2 Oops, I did it again: AdaBoost Intuition
    3.3 The Algorithm
  4. Gradient Boosting
    4.1 All I Want for Boosting is Gradient
  5. AdaBoost in Practice

1. I Contain Multitudes¶

multitudes.png

1.1 Decision Trees¶

decision_tree.png

1.1 Decision Trees¶

Define impurity measure, e.g Gini impurity index: $G(N) = \sum_{k}p_k(1-p_k)$.

gini_binary.png

Algorithm time complexity: O(?)

No description has been provided for this image

1.2 A song of Bias and Variance¶

bias_variance.png

1.2 A song of Bias and Variance¶

  • $Y=f(\bf{X}) + \epsilon$: true relationship between Y observations and X covariates.
  • $\epsilon$: Gaussian noise with zero mean and standard deviation $\sigma_\epsilon$.
  • $\hat{f}(\bf(x))$: model. $$ Err(x)=E[(Y - \hat{f}(x))^2] $$

$$ Err(x)=(E[\hat{f}(x)] - f(x))^2 + E[(\hat{f}(x) - E[\hat{f}(x)])^2] + {\sigma_{\epsilon}}^2 $$

$$ Err(x)=Bias^2 + Variance + irreducible\quad error $$

1.3 Bagging vs Boosting¶

bagging_boosting-3.png

2. The strength of weak learners¶

Hypothesis Boosting Problem (Michael Kerns, 1988)

Does an efficient learning algorithm that outputs an hypothesis whose performance is only slightly better than random guessing implies the existence of an efficient algorithm that outputs an hypothesis of arbitrary accuracy?

weak_to_strong_learner.webp

2.1 A short Trip to Classtown¶

  • Destination: $\{(x_1, 1), (x_2, 1), (x_3, -1)\}$
  • labels form a point in $\mathbb{R}^3$
  • movement hypothesis $h$: $(h(x_1), h(x_2), h(x_3))$
  • total movement: $H=\sum_\limits{t}\alpha_th_t(x)$

function_space.png

2.2 Does it actually work?¶

boosting_intuition.gif

3.1 AdaBoost: Decision Stumps and Decision Boundaries¶

Yes, a weak learner can be boosted into a strong one (Schapire, 1990).

Decision stump: a decision tree with only one split.
Decision boundary: line separating classes in a classification problem.

No description has been provided for this image

3.2 Oops, I did it again: AdaBoost Intuition¶

  1. Train a decision stump that learns a model to ensure that training examples with higher weights are prioritized.
  2. Update the weights of the training examples such that misclassified examples are assigned higher weights; the worse the error, the higher the weight.

adaboost_iteration0.png

adaboost_iteration1.png

adaboost_iteration2.png

adaboost_ensemble.png

3.3 AdaBoost: the Algorithm¶

  1. Train a weak learner $h_t (x)$ using the weighted training examples, $(x_i,y_i,D_i)$

    • Compute the training error $\epsilon_t$ of the weak learner $h_t(x)$
    • Compute the weight of the weak learner $\alpha_t$ that depends on $\epsilon_t$.
    • $\epsilon_t=\sum_\limits{i:y_i\neq h(x_i)}D_i$
  2. Update the weights of the training examples

    • $D_i e^{-\alpha_ty_{i}h(x_i)}$

How is the weight changing for examples correctly classified? How for wrong ones?

The overall classifier after t iterations is just a weighted ensemble:

$$ H(x)= \sum_{t=1}^T \alpha_t h_t (x). $$

No description has been provided for this image

4.1 All I want for Boosting is Gradient¶

Branin function as a test function to visualize gradient descent. The Branin function is a function of two variables $w_1$ and $w_2$:

$$ f(w_1, w_2) = a (w_2 - b w_1^2 + c w_1 - r)^2 + s (1-t) \cos{w_1} + s $$

where a = 1, b = 5.1/4Ï€2, c = 5/Ï€, r = 6, s = 10, and t = 1/8Ï€ are fixed constants.

No description has been provided for this image
No description has been provided for this image

4.2 Gradient Boosting Intuition¶

  • Loss function $L$
  • Model at step $t$: $H_t(x)=H_{t-1}(x) + \alpha_t h_t (x)$
  • Choose $h_t(x)$ to be the closest to the direction of the negative gradient of $L$: $-\frac{\partial L}{\partial H}\big|_{H=H_{t-1}(x)}$

boosting_vs_gradient_boosting.png

4.3 AdaBoost vs Gradient Boosting¶

adaboost_vs_gradient_boosting.png